Content based network model with duplication and divergence 
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We construct a minimal content-based realization of the duplication and divergence model of 
genomic networks introduced by Wagner [A. Wagner, Proc. Natl. Acad. Sci. 91, 4387 (1994)] 
and investigate the scaling properties of the directed degree distribution and clustering coefficient. 
We find that the content based network exhibits crossover between two scaling regimes, with log- 
periodic oscillations for large degrees. These features are not present in the original gene duplication 
model, but inherent in the content based model of Balcan and Erzan. The scaling exponents 71 
and 72 = 71 — 1/2 of the Balcan- Erzan model turn out to be robust under duplication and point 
mutations, but get modified in the presence of splitting and merging of strings. The clustering 
coefficient as a function of the degree, C(d), is found, for the Balcan-Erzan model, to behave in a way 
qualitatively similar to the out-degree distribution, however with a very small exponent ai = 1 — 71 
and an envelope for the oscillatory part, which is essentially fiat, thus 02 = 0. Under duplication 
and mutations including splitting and merging of strings, C{d) is found to decay exponentially. 

PACS Nos: 02.10.Ox,89.75.Da, 89. 75. He 
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I. INTRODUCTION 

Biological networks have recently been the subject of 
many theoretical studies The gene-duplication 

model introduced by Wagner [a, IE SI for protein inter- 
action provides a biologically motivated stochastic mech- 
anism for the emergence of a small-world, scale free net- 
work Q, with the topological properties depending on 
just one parameter. This parameter may be adjusted to 
obtain values of the scaling exponent for the degree dis- 
tribution, which may be compared with those reported in 
the literature for the genomic networks. There is a wealth 
of evidence pointing to the fact that a major portion of 
the genomic network in many organisms in fact originate 
from the duplication of transcription factors (TF) and 
target genes JJ], 11, 12, 13j. 

Wagner's mathematical model 0, Q involves nodes, 
corresponding to genes, and edges, corresponding to ge- 
nomic interactions, connecting the nodes. Duplication of 
the nodes results in the inheritance of all of their previ- 
ous connections. It is postulated that there exists a fi- 
nite mutation rate which causes the edges to be removed, 
with some fixed probability. In this minimal model, once 
a number of interactions are in place, evolution by gene 
duplication and mutations generically drives the network 
to a scale free topology. Natural selection can then act 
on this spontaneously arising scale free network. 

The purpose of the present study is to associate the 
point-like nodes of the Wagner model with binary strings 
of different lengths, and investigate the behavior of the 
network under fixed rates of duplication and mutation of 
these strings. For this, one must i) postulate a rule for 
the connectivity of the network, ii) specify the particu- 
lar ways in which duplications and mutations are to be 
affected. 

The connectivity rule which we will adopt was origi- 
nally proposed by Balcan and Erzan |0| for a content- 
based null model for genomic networks. In this model 



(henceforth, the BE model) the connectivity is decided 
by the similarity of the strings associated with each node. 
This model is striking in that the content based connec- 
tivity rule gives rise to a spontaneous self-organization 
of the genome into a network with a definite scaling be- 
havior [Til Il5| . The suggestion is that the assembly of a 
linear code of sufficient length already provides the pos- 
sibility for complex inter-genomic interactions based on 
homologies, which may later be fine tuned by natural 
selection. 

In the present paper we will combine the inputs from 
the two models, the Wagner model and the BE model, 
to investigate how the content based interaction model 
evolves under duplications and mutations. We will con- 
centrate on the out-degree distribution to characterize 
the topology of the network. We will also give an ah 
initio derivation of the clustering coefficient for the BE 
model, and investigate how its scaling behavior 0, 
changes under duplications and mutations. 

It should be mentioned that there have recently been 
several studies of models of gene regulatory networks on 
"Artificial Genomes" (AG) [l9| based on various alpha- 
bets and matching rules. The Wagner model has been 
simulated with strings mimicking genes, promoter se- 
quences and transcription factors, which evolve under 
duplication and divergence due to mutations 
These derivative models employ different degrees of "real- 
ism" in their schematization of the interactions involved. 
In this paper we will stick to the simplest possible rules of 
perfect matching between binary strings associated with 
each node, for the establishment of the interaction ma- 
trix. Moreover we will not make any distinctions between 
regulatory sequences, genes, or TFs, keeping the model 
to a minimum, so that it may eventually be amenable to 
analytical treatment. 

In section 2, we outline the original BE model for com- 
pleteness, and give a summary of its properties. In sec- 
tion 3 we define an out-clustering coefficient, and report 



2 



on our analytical results for its scaling behavior as a func- 
tion of the out-degree. In section 4, we introduce the 
protocols for duplication and mutation, which we have 
adopted in the present study. In section 5, we report the 
results of our simulations. Section 6 provides a discus- 
sion with an overview of other content based models of 
duplication and divergence. 



II. THE BE MODEL 



The BE network 14] is defined via the following rules. 

1. Consider an alphabet consisting only of the sym- 
bols {0, 1, 2, . . . r}, with the character "r" always 
signifying the delimiter between successive words. 
For convenience, we assume that an "r" has also 
been placed at the 0*'' and [L + 1)*'' positions of 
the whole sequence. (In the original BE paper, and 
in most of the rest of this paper, r ~ 2.) 

2. Form a linear string C (our "Artificial Genome") of 
length i, by randomly choosing letters from this al- 
phabet, according to a predetermined distribution. 
We will think of this string as a one dimensional 
lattice and speak about the "fcth site" or the "fcth 
member of the sequence," interchangeably. 

3. Associate the substrings Gi intervening between 
the ith and i + 1st delimiters, with the i -\- 1st node 
of a graph. (Thus the strings Gi are formed from 
the characters {0, 1, . . . r— 1}, therefore an alphabet 
of length r). 

4. The connectivity rule: Postulate a directed edge 
to exist between the nodes (?,j) if and only if the 
string Gi associated with the ith node occurred at 
least once in the random word Gj associated with 
the jth node, i.e., Gi C Gj . This "inclusion rela- 
tion" determines the connectivity matrix Wij . Thus 



1 if Gi C G, 



(1) 



and is zero otherwise. Note Wij ^ Wji unless the 
lengths of the strings associated with the nodes i, j 
happen to be equal. Clearly Wij ~ for Ij < U. 

In the two previous papers [l^lTsj l. we chose the follow- 
ing distribution function for the independently and iden- 
tically distributed random variables x occupying each of 
the positions along the string C, 

1 '■"^ 

P{x)=p5{x-r) + {l~p)-^5{x-m) . (2) 

Note that the resulting set of words obey the following 
sum rules. 



(3) 



and moreover, 

{t)=p-^-l, {n,)=Lp\\ {N)=Lp. (4) 

where ni is the number of nodes with associated words of 
length I. Note that the notation (. . .) signifies averages 
over many different realizations of C. 

The topological properties of this model have been ex- 
tensively reported in two previous papers 0, ^| , so we 
will only give a brief summary here. For the network to 
consist of a large connected cluster, the frequence of the 
delimiters should be large enough; equivalently, the aver- 
age word length should not be too large. This introduces 
a threshold for p, such that the network is connected for 
P > Pc(L), where Pc decreases with L 0|. Here we will 
assume Pc < p <C 1. We checked that the properties of 
the network depended only very weakly on p as long as it 
is small. The connected cluster was found [Ijl to be of the 
extremely small- world type 0, 0, 3 , with the supremum 
of the smallest distance between any given pair of nodes 
being independent of the size of the network, and an aver- 
age clustering coefficient which is much larger than that 
expected for Erdos-Renyi random networks p6|. 

We have performed both simulations j3 ^tnd analyti- 
cal calculations |lH to determine the expected out-degree 
distribution, namely the average degree distribution com- 
puted over a large ensemble of independent realizations 
of the sequences C of identical length L. We find that 
those nodes with high out-degrees d are associated with 
short strings, of length ^ = 1,2,3,..., and that for small 
p and finite L the degree distribution is Poisson about a 
mean 



d, = 



N 



(5) 



p + qz^ 

for each /. We have defined z — 1/r and q ^ 1 ~ p. The 
width of these peaks are given by 

(6) 



af = di 



for p ^ 1 and for finite L. The spacing between the peaks 
shrinks exponentially with I, while their width shrinks ex- 
ponentially with 1/2, so that for I larger than a critical 
value, the peaks merge to yield a smooth distribution. 
Thus there are two distinct regimes in the out-degree 
distribution. For relatively small I, where there is a log- 
periodic oscillatory behavior with discrete peaks, the en- 
velope of the log-periodic oscillations E{d) behaves as 



'^(d/Nr^^ 



with 



72 



1 In z — In g 

2 In z + In g ~ 2 



P 
In z 



(7) 



(8) 



It should be noted that if we allow L ^ oo (and thereby 
N — > oo), the finite size scaling corrections no longer 
apply. Then, the variance of the peaks is given by 

-^'l-g(l-z')^ ' 



(9) 
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for small I, which is smaller than the finite size result, 
and the envelope no longer decays with very large d, but 
instead grows, becoming higher and sharper for smaller 

I. M 

As iV or L are allowed to increase without bound, we 
find that 

E°°{d) = E{d){l + aid^' + a2d^^ . . .) (10) 
where the exponents are 
In 2; 



5i 



In z — In q 



2Si 



(11) 



Note that the area under the I = 1 peak in fact counts 
the total number of non-null nodes in the network. 

In the large I, small d regime, the distribution scales 
with a different exponent, 

P{d) - d-'f^ , (12) 

for which our analytical calculations yield 



1 

71 = ^+12 



(13) 



The in-degree distribution is more localized, and also 
displays a discrete structure for small degrees, while the 
large degree tail decays like a Gaussian. 



III. THE CLUSTERING COEFFICIENT 

The clustering coefficient C{d) [H [3 

measures the 

probability that two nodes which share a neighbor of de- 
gree d are connected between themselves, and is defined 
as 



Cid) 



^ n{d) 
''^)^did~l)' 



2ei 



(14) 



where n{d) is the number of nodes with degree d, the sum 
is over all nodes of degree d, and ci is the number of edges 
which connect the neighbors of the ith node of degree d. 
The brackets again signify averaging over an ensemble 
of independent realisations. Clearly, if we multiply this 

quantity with ^ 2 ^ obtain the average number of 

triangles which which pass through nodes of degree d. 
Therefore C{d) is of interest in discussing the prevalence 
of different motifs (l/j within a given network. It has 
been conjectured that this quantity scales with d, such 
that C{d) ~ with a being called the "hierarchical 
exponent" p^H^. 

We find that a more interesting quantity in our context 
is the clustering coefficient defined with respect to the 
out-neighbors of a node with out-degree d. We have. 



Coutid) = { 



1 



n-out (d) 



nout(<i) 

E 



2e,; 



d{d-l)' 



(15) 



Note that, due to the transitivity of the connectivity rule 
WabWbc = 1 implies Wac = 1- This provides an impor- 
tant simplification in the computation of Cout{d)- From 
now on we will drop the index "out," and mean (|15|l by 
C{d). 
We have, 



L k2 ki 



^(^) = E EE ^(^i'^)^'(^i' ^2, 10 p(fci)p(fe) . (16) 

k2 ki l—l 

Here, P{l\d) is the probability that a node with degree 
d has the length I, and p{ki) is the probability that a 
neighbor of such a node has the length ki. Using P{d\l), 
the probability that a string of length / has the out-degree 
d, we may write. 



pm = 



P{d\l)p{l) 

p{d) 



(17) 



where p{l) — ni/N is the probability of encountering a 
string of length I, P{d) is the probability of a node of 
out-degree d, and 



p{ki 



(18) 



Finally, p{ki,k2,\l) is the probability that a string of 
length ki is found inside a string of length ^2 > , given 
that the same string of length I < ki is found in both. 
Note that, in obvious notation, 

p{ki, k2, \l)p{l E ki,l e k2) = p{ki E k2,l E ki,l E k2) . 

(19) 

Due to the transitivity relation stated above, the right 
hand side reduces to p{ki E k2,l E ki). Using the fact 
that the joint probability on the left hand side must fac- 
torize, since each string is constituted independently, we 
have. 



p(fci, fc2, 10 = p(fci E k2)/p{l E fcz) 



(20) 



The quantities p{l E k) = p{l, k) are simply the probabil- 
ity that a random string of length I is found in a string 
of length k > I. These quantities have been computed in 
jj^^j as 



p{l,k) = l-{l-r-') 



k-l+l 



(21) 



for strings drawn randomly from an alphabet of length 
r. We also have from that, for small d (large /), 



Pid\l) 



dfe-"^' 
d\ 



(22) 



where di is given by (O and P{d) by l(T^ . 

We would now like to find the scaling behavior of C{d). 
For small d, substituting all the relevant quantities in 
(|16|l . one finds that the Z-sum may be performed using 
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a saddle-point approximation, and is independent of the 
upper limit, so that the subsequent sums over ki, k2 lead 
to constants. After some algebra we find, 

C((i) ~ const, d^^^^'^i) ~ , (23) 

where 



with the approximation being valid for small p, giving 
the numerical value, for r = 2, p = 0.05, of ai = 0.07. 

In the region of large d (small I), we have already seen 
that the distribution of nodes with out-degree d breaks 
up into discrete peaks. For each I = d'f^{d), where the 
inverse function is meant by the power —1, we have the 
height of the peak in the attendant C (d) distribution to 
be 

cm] = E E ^'(^i' \i)p{k,)p{k2) . (25) 

k2—l ki—l 

It is actually possible to sum this series to lowest order in 
(fci — I) /I and the leading term is simply independent of 
I, and therefore of d. Thus, the envelope of these peaks 
behaves as ~ d""^ ^ const., or, in other words, 

a2 = . (26) 

These results reveal that the out-clustering coefficient is 
essentially independent of d for all practical purposes, for 
both small and large out-degrees d. 

The simulations results completely agree with these 
findings, as will be shown in Section 5. 

IV. CONTENT BASED MODEL FOR THE 
EVOLUTION OF GENOMIC NETWORKS 

In this section we define our network model with dupli- 
cation and divergence, based on the original BE model 
(Section 2), with the different algorithms for evolution 
under duplication and various kinds of mutations. To 
avoid confusion, apart from standard terminology, we will 
reserve the term "sequence" for the complete sequence C 
of length L. The subsequences Gi of C, which appear 
between the delimiter marks, we will call "strings," or 
"words." In the rest of the paper, we will restrict the 
model to r = 2, so that the strings (words) to be asso- 
ciated with the nodes of the network are binary strings 
whose elements are € {0, 1}, and the delimiters between 
the words in the sequence C are indicated with the sym- 
bol 2. In the course of the evolution of the sequence, a 
number of delimiters may aggregate at successive sites, 
seeming to bracket strings of zero length. These "null 
strings" are not associated with any nodes and ignored 
in the construction of the network. The connectivity rule 
will be the same as that of the BE model, given in Eq.QJ 
above. 



A. Mutations 

At each time step for each fcth member Xk of the 
complete sequence C, we decide with a fixed probability 
/i whether it will suffer a mutation, independently of the 
other members. 



Mutation rule Ml 

1. If Xk G {0, 1}, then we change Xfe to aJfc = 1 — Xk- 

2. If Xfc — 2, then we exchange it with the symbol to 
either its right or left, with equal probability. 

Clearly item (1) corresponds to a substitution in the 
code, while item (2) corresponds to a shift in the reading 
frame starting at the delimiter site. We have checked that 
on the average, it makes no difference whether the sites 
i to be mutated are picked at random or sequentially. 

Extensive simulations which we reported in show 
that the BE model is robust with respect to mutations 
Ml. In fact, applying the set Ml a large number of times 
may be used to generate independent random sequences 
C from a given initial sequence. Note that Ml strictly 
conserves the length L. 



Mutation rule M2 

A more complicated algorithm M2, consists of a set of 
three different operations, insertion, deletion, or replace- 
ment. Once it has been decided to mutate a site fc, one 
of these operations is picked with equal (1/3) probability, 
independently of the value of Xfc. 

1. Insertion. A character, picked with equal probabil- 
ity from the set {0, 1,2}, is inserted to the right of 
the kih. site, coming to occupy the k + 1st site. 

2. Deletion. The fcth element is deleted. 

3. Replacement. The fcth element is randomly re- 
placed by either or 1 with equal probability. 

Note that M2 preserves L on the average, since the op- 
erations of insertion and deletion are picked with equal 
probability, and replacement does not change the number 
of sites. On the other hand, the insertion of the delimiter 
(i.e., the symbol 2) will cut a word into two successive 
words; the deletion of a delimiter may merge two suc- 
cessive words into one. Moreover, the replacement of a 
delimiter by either a or a 1 will again make one word 
out of two neighboring strings. Thus, the implementation 
of M2 can split or merge pairs of nodes of the underlying 
graph, changing the total number of nodes. Nevertheless, 
the number of nodes is also conserved on the average in 
the long time behavior, leading to a steady state with a 
conserved average string length {I). 
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B. Duplication 

Finally, we postulate that at each time step, a ran- 
domly chosen string d is duplicated. The duplicated 
string may then be inserted in tandem with the original 
string, in the original or in reversed order, or inserted at 
a random point somewhere in the sequence C. With 
equal probability, one of the following options is chosen: 

1. Tandem insertion. The copied string is inserted 
right after the original, in the same order. With 
probability 1/2, a delimiter is inserted between the 
two copies, and with probability 1/2, they are run 
on without the delimiter (which only appears at 
the end of the inserted copy). Thus, with equal 
probability, either a new node is created, or the 
the word associated with the ith node is doubled 
in length. 

2. Reverse tandem insertion. Same as above, except 
that the polarity of the string is reversed before it 
is inserted. 

3. Random insertion. The copied string is inserted at 
a random point in the sequence C without any de- 
limiters on either side of it. This does not give rise 
to a new node, but only modifies the word associ- 
ated with an existing node. 

After transients, duplication leads to a steady increase 
in the number of strings for both sets of mutation rules. 
The long time, large I behavior of the string length dis- 
tribution can be obtained analytically for various cases, 
and is given in the Appendix. 



C. Initial conditions 

The probability distribution according to which the ini- 
tial sequence C will be formed may also be picked in dif- 
ferent ways. One choice, which we specified in the previ- 
ous section, Eq.|2Jl, results in an exponential distribution 
of string lengths. 



(27) 



where ^ ~ 1/p for small p, and corresponds to a random 
genome. 

A second choice is motivated by considering the strings 
with which to associate the nodes, to stand in for the 
so called promoter sequences in the gene regulatory net- 
work. In this case, inspection of the data banks publicly 
available on the internet 24| shows that the lengths of 
the consensus sequences ^25, 26,j are distributed in a much 
narrower fashion, around a central peak. This distribu- 
tion we have mimicked with a Gaussian, 



PS) 



1 



%/27rCT2 



(28) 



The procedure by which the sequence is generated in this 
case is that the lengths U of the successive strings are 
picked from the distribution in Ea. (|28(l . the delimiters 
are inserted at the sites ki — J^jKii^i + 1) finally 
the rest of the sites filled with the symbols and 1 with 
equal probability. 

We have used the above sets of rules to construct two 
models, which combine them in different ways. Model I 
comprises the mutation set Ml, with or without dupli- 
cations, and with either sets of initial conditions. Model 
II comprises the mutation set M2, again with or without 
duplications, and with either set of initial conditions. 



V. SIMULATION RESULTS 

In this section we summarize our simulations results 
for the out-degree distributions and the clustering coef- 
ficients for the out-neighbors. All simulations were done 
on sequences of initial total length Lq — 1.5 x 10*. The 
value of p = 0.05 was chosen for those simulations with 
exponential initial length distributions (and therefore on 
the average 750 random initial sequences). The param- 
eters of the Gaussian distribution are given by Iq = 15 
and (7 = 2, while the number of strings (or the number 
delimiters) for the Gaussian initial length distributions 
was TVq = 700. The value of the mutation probability 
was chosen /i = 0.05. Averages were performed over 500 
realizations in each case. 



A. The out-degree distribution 

The results from the two models should coincide for 
t=0, for the two different initial conditions. In Fig.l 
the plot (P) for the exponential length distribution of 
course reproduces the results from the original BE model, 
and one can clearly discern the discrete and continuous 
regimes. The plot (U) for the Gaussian length distribu- 
tion shows less structure, since the paucity of very small 
strings shortens the discrete part of the plot, and gives it 
the overall appearance of just one power law distribution 
with 7 ~ 2, with some statistical scatter toward large 
degrees. 



We should note that the length L of the sequence C, 
as well as the number of nodes, N, and the average word 
length, (/), changes with time under duplication. Since 
there is a subtle interplay between the different kinds 
of mutation and duplication, this change is not uniform. 
For Model I we find (l), N and L stay constant under 
Ml, and they all grow linearly when duplications are in- 
troduced. In Model II, (l) relaxes exponentially fast to a 
relatively small constant in all cases, during which time 
iV grows nonlinearly and saturates to its long time behav- 
ior. For longer times N then behaves like i) a constant, 
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FIG. 1: (Color online)The out-degree distributions for ex- 
ponential (P) and Gaussian (U) initial length distributions, 
at time t=0, shown on a double logarithmic plot. For the 
random sequence (P), the exponent for the initial putative 
scaling region and the envelope for the oscillatory regime are 
indicated on the plot. The crossover between the two regimes 
have been indicated by a vertical dotted line. The Gaussian 
length distribution (U) gives rise to an overall slope of ~ —2. 



without duplications, and ii) grows linearly with duplica- 
tions,while L follows suit, with L ~ N{l).{See Appendix) 
In both models, the successive application of the mu- 
tation and duplication algorithms randomize the distri- 
bution of the delimiters within the sequence. Model II 
converges to a steady state exponential distribution of 
string lengths, both in the absence and presence of du- 
plications. Model I does so in the absence of duplication, 
where the rate is determined by the random walk step 
executed by each delimiter, at each time step. (Note 
that double occupancy of any site by any character is 
prohibited, so that the problem becomes that of one di- 
mensional diffusion with exclusion.) In the presence of 
duplication, the population at longer string lengths grows 
by insertion of the copied string without the delimiters 
and also with random point insertions, but the dynam- 
ics is very complicated and very persistent transients are 
present. The set of mutations M2, employed in Model 
II, comprising insertions, deletions and replacements, is 
more effective in randomizing the positions of the delim- 
iters. Below we give the results for five hundred time 
steps, where subtle differences arise due to the contin- 
ued presence of transients in the simulations of sets with 
Gaussian initial length distributions. 



The result of 500 time steps of evolution for Model I 
is shown in Fig. 2. It is remarkable that the presence or 
absence of duplications alters the scaling behavior very 
little - the plots labeled PMl and PDMl, signifying ex- 
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FIG. 2; (Color online) The out-degree distributions for Model 
I, after t=500 steps. Ml mutation rules apply, (a) Exponen- 
tial (PMl, PDMl) and (b) Gaussian (UM1,UMD1) initial 
length distributions, with and without duplication (D), are 
shown on a double logarithmic plot. See text. 

ponential initial length distribution, Ml mutation rules, 
with and without duplication (D), falling almost right on 
top of each other. The effective p value computed from 
Ea. (|27|) is identical for the two graphs and the same as 
the initial value oi p — 0.05. The Gaussian length distri- 
butions lead to results that are almost indistinguishable, 
even though an inspection of the length distributions at 
t=500 (not shown), reveals that they have not yet con- 
verged to exponential distributions, but show a marked 
peak displaced a little to the left of Iq, with, however a 
growing exponential tail for large I. The only marked 
difference is between the last couple of peaks of the UMl 
and UMDl plots, seen more clearly in the linear plots 
(Fig. 3) of the same quantities as in Fig. 2b. We see 
that the UMl simulation, which, without duplications, 
is much less deformed from the original Gaussian distri- 
bution in string lengths, and has a smaller large-Z tail, 
displays a much sharper peak at large /. This corre- 
sponds to the much smaller dispersion in the out-degree 
distribution of the nodes with the smallest strings. 
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In Fig. 2a, we see that the introduction of duplication 
does not alter the scaling exponents from the original BE 
values, with 71 — 0.94 and 72 = 0.43 falling right on top 
of the previously reported values for the BE model. The 
exponent values for the initial Gaussian length distribu- 
tions, shown in Fig. 2b, do not show any difference with 
respect to the presence or absence of duplications. We 
find 71 = 1.07 and 72 = 0.57. 



-•-UM1 
-T-UDM1 




FIG. 3: The linear plot of the out-degree distribution for 
Model I with Gaussian (U) initial length distributions, and 
Ml mutation rules, after t=500 steps. With (D) and without 
duplication. 



Finally, in Fig. 4, we show the scaling behavior of 
Model II. The graphs shown are very similar to those 
without duplications, and therefore we omit the latter. 
We observe that there is a significant change with re- 
spect to Model I in the exponents, especially in 71. The 
values are given in Table I. The first two columns are 
the values of 71 and 72 as found from linear fits to the 
double logarithmic graphs. The third columns in the ef- 
fective p values found from Ea. (|27|l . from an inspection 
of the exponential length distributions achieved by the 
random sequences. However, the values 7"^^ found for 
the exponents upon the substitution ofp"^^ in Eas. (|8ll3|l . 
as shown in the last two columns, are somewhat different 
from the simulation results, especially for the oscillatory 
regimes. On the other hand, these same values, although 
detectable only over a very restricted range of In d, stay 
rather close to the original BE results for 72. 

Clearly, the assumption of statistical independence of 
the strings associated with different nodes, which is em- 
ployed in the derivation of these equations, does not ap- 
ply in this case. What is remarkable, is that the muta- 
tions M2 seem to play a much stronger role in leading 
to correlations between nodes of the underlying network. 
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FIG. 4: (Color online) The out-degree distributions for Model 
II with exponential (PDM2) and Gaussian (UDM2) initial 
length distributions, after t=500 steps, shown on double loga- 
rithmic plots. The corresponding graphs without duplications 
are very similar. 



TABLE I: Scaling exponents for Model II after t=500 steps of 
mutation (M2), with or without duplication (D). P indicates 
a random genome, with an initial exponential distribution 
of word lengths, whereas U stands for an intially Gaussian 
distribution. 





71 


72 




7f 


72 


PM2 


0.80 


0.41 


0.147 


0.81 


0.31 


UM2 


0.76 


0.43 


0.141 


0.82 


0.32 


PDM2 


0.80 


0.47 


0.145 


0.82 


0.32 


UDM2 


0.79 


0.41 


0.14 


0.82 


0.32 



than do duplications. 

In Fig. 4 and 5, we see that the increase in the num- 
bers of nodes combines with the relatively short average 
string length in Model II, to remove the finite size effects 
which dominate the behavior observed for the BE model 
and for Model I. As shown by an explicit calculation ( [Tsf 
and Section 2), that the high out-degree distribution for 
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the infinite sequence, i.e., for the case L oo, the hight 
of the peaks for successive out-degrees should in fact in- 
crease with d, and this is what we observe in Fig. 4 and 
Fig. 5. 

For the very large values of the degree d, we observe an- 
other effect which can be called the fine-structure of the 
degree distribution, which has been discussed by Bilge et 
al. [23|, who have shown that the exact probability for 
a sequence of length I to be found in another sequence 
of length I' is determined by a number called the "shift 
match number" of the first sequence. (For any given bi- 
nary sequence of length I, representing a number a, the 
shift match number s{a) is given by the binary sequence 
Si{a), i = 0, . . . ,1 — I, where Si{a) = 1 if the subsequence 
fli+i, ... a; is congruent to ai, 02, . . . a;_i, and zero 
otherwise.) Each of the peaks corresponding to different 
string lengths I split into distinct peaks according to the 
shift match numbers, as the number N of strings in the 
sequence C becomes very large, so that the true degree 
distributions of at least the very short strings are suffi- 
ciently sampled. The leading peak on the right in Fig. 
5 corresponds to / = 1, the twin peaks on its left corre- 
spond to I — 2, while the next three peaks correspond to 
I = 3. 



results for Model II, for which the exponents for the out- 
degree distribution show a quantitative departure from 
those of the BE model. Once we turn on the mutations 
and duplications, the behavior is changed qualitatively, 
to an exponential decay with d, exhibiting an effect which 
one may call "out-neighbor repulsion." 
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FIG. 5: (Color online) The out-degree distributions for Model 
II with duplications, after t=500 steps, shown on a linear 
plot. We have plotted d/N on the horizontal axis, to bring 
the networks with different A'^ into coincidence. 



B. The clustering coefficient Cout(d) 

In FigEl we have plotted the out-clustering coefficient 
Cout(c^), Ea. l(T^ . as a function of the out-degree. We 
see that our analytical calculation for the BE model is 
born out very well, with the slopes of either the initial 
continuous regime, or the envelope of the oscillatory re- 
gion, being essentially negligible. In Fig^lD we report the 



-•-PDM2 t=100 

.1.5 -T- PDM2 1=200 




' 50 160 150 200 250 300 
d 



FIG. 6: (Color online) The out-clustering coefficient for the 
BE model (upper panel) and Model II, lower panel, with dupli- 
cations and mutations. The results have been averaged over 
200 realizations. 

It is possible to gain a heuristic understanding of this 
exponential decay, or repulsion, by considering the fact 
that in Model II, at each time step one string out of is 
duplicated (the copy automatically inheriting all previ- 
ous connections by our definition of connectivity), while 
strings at both ends of a bond, say of lengths I and k will 
suffer, on the average, iil and mutations. Clearly the 
mutations are driving the strings apart at a faster rate 
than they are being duplicated. For a hub of length I to 
stay connected to a string of length k after one time step, 
one needs //fc ^ 1, so that those mutations that do occur 



9 



have a greater probability of taking place outside of the 
matching zone. On the other hand, for out-neighbors of 
length ki and ^2 to stay connected to each other becomes 
less probable, the longer they are, since they will suffer 
mutations with greater probability. This drives the con- 
nected clusters to have hubs with relatively small I, with 
out-neighbors of commensurable lengths I. 

In the discrete region of the out-clustering coefficient, 
which is still vaguely identifiable in Fig. 6, it is easier 
to see what is going on. By going back to Ea. (|^ . 
we observe that the lower limits of the sums must be 
effectively replaced by some A » ^, according to the 
above argument. Moreover, in Eq.jSDj), the numerator 
can be well approximated in this limit by p(ki,k2) — 
(^2 — ki + while in the denominator, we have 

p{l,k2) = l-{l-zY^-^+^ ~ exp[-(l-zO'=""'+^], where 
we have used the fact that the expression in the paren- 
thesis is reasonably small for / small. This expression 
raised to ^2 — Z 3> 1 becomes much smaller than unity, 
so that the series can be exponentiated. Doing this, and 
once more employing the trick of substituting, in the re- 
sulting expressions, the / associated with the peak values 
of d, namely / = hi{pd/N)/ \ii(qz), we find that 

C(d) = /A(q,z)e-«'^'" , (29) 

cff 

where ^ = [p/N]''^ A and the effective value of p (see 
Table I) should be used in 7^*^. The predicted behavior 
is not exactly exponential, but with d in the exponential 
having a power close to unity. Although this argument 
is not at all rigorous, if we estimate C from FigElto be 
0(10-2), and using p''^ = 0.15, {N) = 2500 at t = 200 

eft' 

from our simulations, we find A — ({p/N)~^^ ~ 24, 
which would be very much in the right ballpark, for I ~ 
0(1). 

VI. DISCUSSION 

In this section we would first like to give a brief 
overview of some selected papers where different artificial 
genome models have been studied. We then summarize 
our results. 

Geard and Wiles have used the AG proposed by 
Reil jl^, with a four letter alphabet, incorporating genes, 
TPs, and binding sites, or regulatory sequences. In their 
version, the mRNA reproduce a beginning segment (of 
given length) of the genes, with complementarity being 
interpreted as identity. The mRNA is translated to a 
shorter sequence (an artificial protein, or TF), in a many- 
to-one mapping, once more in the same 4 letter alpha- 
bet. The network is established by matching the proteins, 
which are all of the same length, to subsequences to be 
found within the regulatory sequences upstream of the 
genes and separated from them by delimiter sequences 
(the so called TATA boxes) . They have also introduced so 
called sRNAs (short RNA's which bind directly onto the 
regulatory sequences) which enable the creating of hubs 



with connections to many different genes. Their results 
regarding the degree distribution of the regulatory net- 
work are not very realistic, in spite of the great pains they 
have taken to model the various levels of transcription 
and matching in a realistic way. They find that the in- 
degree distribution is exponential, while the out-degree 
distribution is also exponential but with superposed os- 
cillations reminiscent of those found in the BE model and 
the present study. The reason they are able to observe 
these oscillations coming from the discrete matching se- 
quence lengths, is because they average over many real- 
izations of the artificial genome. 

In the paper where Watson, Geard and Wiles im- 
plement duplication and mutation with biologically con- 
vincing mutation operators on the AG of Reil, they un- 
fortunately do not report on the degree distribution, but 
only the average connectivity and clustering coefficient 
of the genomic network. Predictably, these quantities do 
not vary much as the network evolves, except that there 
seems to be a small downward trend in the average clus- 
tering coefficient. 

Finally, van Noort, Snel and Huynen [211] have pro- 
posed a model, where they use duplication and muta- 
tion to generate an AG from a small initial pool of 25 
genes and random TF binding sequences (TFBS). In their 
model, the genes and the TFBS's may be dupHcated by 
either themselves or in pairs, after which they diverge by 
independent mutations. The duplicated TFBSs may be 
randomly inserted upstream of other genes. The TFBSs 
are of fixed length, shorter than the genes. These authors 
have considered the coexpression network, constructed 
from sequence matching between TFBSs. The degree dis- 
tributions for various relative frequencies of duplication 
and mutation obtained in this case are extremely similar 
to the BE results, with a scaling exponent of 7 ~ 1.5, 
except that they report only single realizations, rather 
than the expected distribution (averaged over many re- 
alizations), and therefore no oscillations are observed in 
their graphs. 

Banzhaf |^ has considered a binary AG, the network 
being established by complementarity between the pro- 
tein (TFs) produced by one gene, and the TFBS of an- 
other gene, both sequences being of the same length, and 
the AG being subjected to duplication and divergence 
through mutations. Instead of the degree distribution, 
Banzhaf has examined the frequency of occurrence of dif- 
ferent three-node motifs. It is extremely interesting that 
the duplication-divergence procedure prunes the distri- 
bution obtained for the random AG to bring it into much 
closer resemblance with those obtained for the Escheriche 
coli and yeast genomes. 

We have shown that constructing a content based ver- 
sion of the Wagner 0| model by introducin g dupli- 
cations as well as mutations into the BE model [ij) UM 
yields a complex network which is rather close in its topo- 
logical properties to the original BE model. The expo- 
nents in the putative scaling region observed for small 
d change little from their original values, staying near 
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unity. The behavior in the second, oscillatory regime, is 
enriched due to the growth of the number of nodes, re- 
sulting in the removal of finite size corrections for very 
small i.e., very large d. 

In the context of our content-based model, it was ap- 
propriate to introduce an out-clustering coefficient de- 
fined as the probability that the out-neighbors of a node 
are connected among themselves. We were able to give 
analytic derivations of the dependence of Cout (d) on the 
degree d of the node, for the BE model. These results 
were supported by our simulations. It should be noted 
that all the triangles contributing to this clustering coef- 
ficient which we have defined, are in the form of "feed- 
forward loops," motifs which are believed to play a dis- 
tinctive role in gene regulatory networks 17]. In the 
presence of duplications and mutations, numerical simu- 
lations showed that the putative power law dependence 
on d gave way to an exponential decay. More work is 
needed to explore the whole phase diagram with larger 
duplication rates in comparison to the mutation rates. 
The numbers that were chosen here were motivated by 
biologically observed 0, 0| rates of duplication and mu- 
tation. 
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Appendix 

Here we would like to write down the equations for the 
evolution of the length distribution, p{l,t), under differ- 
ent mutation rules, in the presence or absence of dupli- 
cations, and present solutions for the long time, steady 
state behavior. 

It is useful to make a number of definitions. Let L be 
the total length of the sequence C, consisting of the sym- 
bols {0, 1,2}. We will denote the symbol 2 as a dehmiter, 
and those uninterrupted sequences of {0, 1} as "strings." 
Let N be the number of times that the delimiter occurs 
in C. The effective density of the delimiters is given by 
Pe = N/L. 

If the delimiters are distributed over C in a random 
fashion, the probability that a delimiter is followed by a 
string of length Z > is given by 



pil)=Pe (l-PeY 



(30) 



which is correctly normalized to unity if we neglect the 
contributions coming from the finite size, and effectively 
take L — > oo. Clearly p{0) = pe. Note that if, instead of 
summing the discrete series, one goes to the continuum 
limit and integrates over /, then one obtains a slightly dif- 



ferent normalization, with pcts.(0 = aexp(— a/), where 
a = I ln(l — pe)\ — — ln(l — pe). The probability of find- 
ing an /-string (/ ^ 0) among all the non-null strings 
is Pnn = p(0/(l - Pe) = Pe(l " Pe)'"\ Or similarly, 
aexp[— a(/ — 1)] in the continuum case. 

The number of /-strings is given by nj, which obeys 



m = Np{l,t) = Lpep{l,t) 



(31) 



= L-N^{l)N = Lil-p,) , (32) 

1=0 

where p{l,t) is the time dependent length distribution 
and should not be confused with the matching probabil- 
ities p(/, k) in the text. 

Finally, define nil) to be the probability that a ran- 
domly picked site belongs to a string of length / > 0. 
Then, using H31() we have 



7r(/) = ^ =pelp{l) 



E-(o 

1^0 



1 -Pe 



(33) 
(34) 



We now would like to write down the equations for the 
evolution of the length distribution, p{l,t), under differ- 
ent mutation rules, in the presence or absence of dupli- 
cations, and discuss the long time, steady state behavior. 
We assume that for large times, a mean-field type of ap- 
proximation can be made in writing down the master 
equations. 



Mutation rules Ml 

This set of rules calls for choosing a site [i) at random, 
and with probability /i, 

1. exchanging Xi with 1 — xi if Xi has the value or 1, 

2. interchanging Xi with Xi±i with equal (1/2) proba- 
bility, ii Xi — 2. 

The first case does not lead to any change in p(/, t). In 
the second case one has the following possibilities: 

a) The delimiter happens to be the left (right) bound- 
ary of a string of length /. Then, exchanging it with its 
left (right) neighbor will result in / ^ Z -I- 1, unless this 
neighbor also happens to be a delimiter, in which case 
I does not change. On the other hand, exchanging the 
position of a left (right) delimiter with its neighbor on 
the right (left) will give rise to Z — > / — 1. 

This gives rise to the variation of p{l,t), I > 0, under 
Ml, 

+ il-pe)pil-l,t) + 

p{l + l,t)} , (35) 

under the assumption that a randomly chosen site along 
the sequence C is occupied by a delimiter neighboring a 
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string of length I on the left (or on the right) with prob- 
ability PePnn('ji)- For notational convenience we have 
dropped the time dependence from p{0,t) = Pe{t), the 
density of the delimiters on the string C. 

If one chooses an initial random distribution of the de- 
limiters along C with a probability pe, one has p{l,0) = 
p{l) = pe(l — PeY one finds immediately upon substitut- 
ing into H35|l that the right hand side (RHS) is identically 
zero, since the curly brackets vanish. Thus, (|30|l is a sta- 
ble distribution under Ml. 

Other initial distributions may be explored, by noting 
that in the large I limit, Ea. (|35ll can be rewritten as 



1 dp{l,t) 



dt 



1 



Ml 



1 -Pe 



dp{i,t) , dMi.t) 



dl 



dp 



(36) 

where it is to be understood that we mean the contin- 
uous limit of p{l). This equation is in the form of a 
Fokkcr-Planck equation with drift, on the half line ^ > 0. 
The distribution is an approximate stable time- 

independent solution of (jSni, to 0[pI). On the other 
hand, if one starts with an initial (Gaussian or delta- 
function) distribution centered around /q > O7 one can see 
that the peak drifts to the origin at a rate /iPe/(l — Pe)i 
and becomes extremely narrow, while the distribution 
eventually develops a large I tail. In fact, 13611 may be 
put in the form. 



1 dpiUt) 



dt 



Ml 



a_ 

dl 



[§,m) p{i,t)-pj-^ 

x(l-p,)-i . (37) 



where the effective drift potential is given by <!>(/) = 
/ — 61n(^), and the singular logarithmic term has been 
inserted to mimic the infinite barrier at I — 0. The steady 
state solution is 

p{l)^ble-P^^ , (38) 
where b is fixed by normalization to be 6 = Pg . 

Mutation rules M2 

Similar to the considerations under Ml, we can write 
down the variation of p{l, t) due to the routines of inser- 
tion, deletion and replacement under the set of mutations 
labeled M2. 

The effect of an insertion of 0,1, or 2, with equal prob- 
ability, at a random point in C, is 



dp{l,t) 
dt 



~Pelp{l,t) ~ ^PePiljt) + 

+ lpe {l-l)p{l-l,t) + 
+ iEfi=/+lPeP(^l,i) • 



(39) 



Recall that the probability that a randomly picked site 
belongs to an /-string is 7r(/) = p^. lp{l). The first term in 
Eg. (|39|l comes from the insertion into an /-string of any 



of the three possible choices (0,1 or 2). The second term 
corresponds to the probability of inserting a or 1 into 
an (/ — l)-string and the third term corresponds to the 
probability of inserting a delimiter into a longer string in 
just the two right positions to end up with an /-string. 

The effect of deleting a randomly picked character is 
found in a similar way. 



dpjl.t) 
dt 



del. 



-Pe I P{1, t) - 2pe(l - pe)p(/, t) + 

+Pe (/+l)p(/-l,t) + 
+ Y!i;=lPeP{h,t)p{l - h,t) . (40) 



Randomly picking a character and replacing it with 
either or 1 has an effect on the /-distribution in the 
event that the replaced character is a delimiter. Thus, 



dp{l.t) 
dt 



-2pepil,t)+2plp{l-l,t) 



rep. 



1-2 



J2PePih.t)p{l-h-l,t) .(41) 



We may combine these effects in two different ways, both 
of which leave the average number of strings and the 
total length of the sequence constant, after transients. 
We may, after choosing a character at random, 

a) apply insertion and deletion with equal (1/2) prob- 
ability, or 

b) apply insertion, deletion and replacement with equal 
(1/3) probabihty. 

The analysis for either of the cases a) or b) follows 
along similar lines. One may easily show that the ex- 
ponential distribution in (j30|l always solves the steady 
state equations obtained in the large / limit, where, after 
dividing out the steady state equations by /, only the co- 
efficient of / survives and has to be set equal to zero, to 
determine the value of Pe . 

We obtain, for the case (a) pe = 1/3 and for the case 
(b), Pe — 1/6. These results are fully corroborated by 
simulations. 

One may also combine the mutation operators with 
duplication. 



Duplications 

Let r be the probability that a duplication takes place 
at any given time step. In this context, one time step is 
that interval of time where a randomly chosen element of 
the sequence C is mutated with probability fi. It is clear 
that T = l/L, since one duplication takes place only after 
L elements have been tested for mutation. 

The probability that a randomly chosen string has 
length /, is precisely p{l,t). The variation of p(l,t) due 
to duplication of this string and the insertion of the copy 
into the sequence C, in the original or reverse order, is 
2/3. With probability 1/3, on the other hand, the copied 
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string is inserted at a random point of C (to the right of 
a randomly picked site). Since the insertion can be made 
to the right of the left delimiter, as well as after any of 
the elements of the ^-string, it occurs with probability 
{I + l)p(Z,t) into an Z-string. If, however, a copied string 
of length I ~ li is inserted into a string of length Zi, in 
the + 1 possible positions, the number of ^-strings will 
be increased. Putting all of these terms together gives, 



dt 



rp{l,t) 



dupl. 



{I + 1) p{l, t)-Y, ih + 1) Pih,t)pil ~h,t) 
h=o 



42) 



It is easy to see that substituting the exponential 
ansatz in yields, for the RHS, 



RHS 



1 



TP, (1 -p,y{l - {l~Pe)l + T:Pe{l 2)} 



(43) 

where the nonlinear term in / comes from the sum in 
(|42|l . Thus in this case there is no possibility of making 
the coefficients of I vanish order by order. However, since 
T <ti fi for the interesting cases, to lowest order in t the 
solutions without duplication are still valid, and yield an 
exponential distribution. 
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